
A doença cardíaca descreve uma série de condições que afetam o coração. As doenças incluídas na categoria de doença cardíaca abrangem doenças dos vasos sanguíneos, como doença arterial coronariana, problemas de ritmo cardíaco (arritmias) e defeitos cardíacos congênitos, entre outros.
O termo "doença cardíaca" é frequentemente usado de forma intercambiável com o termo "doença cardiovascular". Doença cardiovascular geralmente se refere a condições que envolvem vasos sanguíneos estreitados ou bloqueados que podem levar a um ataque cardíaco, dor no peito (angina) ou acidente vascular cerebral. Outras condições cardíacas, como aquelas que afetam o músculo, as válvulas ou o ritmo cardíaco, também são consideradas formas de doença cardíaca.
A doença cardíaca é uma das principais causas de morbidade e mortalidade entre a população mundial. A previsão de doenças cardiovasculares é considerada um dos temas mais importantes na análise de dados clínicos. A quantidade de dados na indústria de cuidados de saúde é enorme. A mineração de dados transforma a grande coleção de dados brutos de saúde em informações que podem ajudar a tomar decisões e previsões informadas.
De acordo com um artigo de notícias, a doença cardíaca se mostra como a principal causa de morte tanto para mulheres quanto para homens. O artigo afirma o seguinte:
Aproximadamente 610.000 pessoas morrem de doença cardíaca nos Estados Unidos a cada ano - isso representa 1 em cada 4 mortes.
A doença cardíaca é a principal causa de morte tanto para homens quanto para mulheres. Mais da metade das mortes por doença cardíaca em 2009 ocorreram em homens.
A doença arterial coronariana (DAC) é o tipo mais comum de doença cardíaca, causando mais de 370.000 mortes anualmente.
Todos os anos, cerca de 735.000 americanos sofrem um ataque cardíaco. Destes, 525.000 são um primeiro ataque cardíaco e 210.000 ocorrem em pessoas que já tiveram um ataque cardíaco anteriormente.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport
%matplotlib inline
import plotly.express as px
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
pd.set_option('display.max_columns', None)
| # | Feature | Description |
|---|---|---|
| 1 | HeartDisease | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) |
| 2 | BMI | Body Mass Index (BMI) |
| 3 | Smoking | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] |
| 4 | AlcoholDrinking | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week |
| 5 | Stroke | (Ever told) (you had) a stroke? |
| 6 | PhysicalHealth | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 |
| 7 | MentalHealth | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? |
| 8 | DiffWalking | Do you have serious difficulty walking or climbing stairs? |
| 9 | Sex | Are you male or female? |
| 10 | AgeCategory | Fourteen-level age category |
| 11 | Race | Imputed race/ethnicity value |
| 12 | Diabetic | (Ever told) (you had) diabetes? |
| 13 | PhysicalActivity | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job |
| 14 | GenHealth | Would you say that in general your health is... |
| 15 | SleepTime | On average, how many hours of sleep do you get in a 24-hour period? |
| 16 | Asthma | (Ever told) (you had) asthma? |
| 17 | KidneyDisease | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? |
| 18 | SkinCancer | (Ever told) (you had) skin cancer? |
| 19 | HeartDisease_FamilyHistory | Do you have family history of heart disease? |
| 20 | State | US sate (residency) |
df = pd.read_csv("C:/Users/victo/Documents/NUCLIOMESTRADO/heart_disease_project_data/heart_disease_data.csv")
df
| HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer | HeartDisease_FamilyHistory | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | No | 16.60 | Yes | No | No | 3.0 | 30.0 | No | Female | 55-59 | White | Yes | Yes | Very good | 5.0 | Yes | No | Yes | No | MT |
| 1 | No | 20.34 | No | NaN | Yes | 0.0 | 0.0 | No | Female | 80 or older | White | No | Yes | Very good | 7.0 | No | No | No | NaN | VT |
| 2 | No | 26.58 | Yes | NaN | No | 20.0 | 30.0 | No | Male | 65-69 | White | Yes | Yes | Fair | 8.0 | Yes | No | No | NaN | WY |
| 3 | No | 24.21 | No | NaN | No | 0.0 | 0.0 | No | Female | 75-79 | White | No | No | Good | 6.0 | No | No | Yes | No | VT |
| 4 | No | 23.71 | No | No | No | 28.0 | 0.0 | Yes | Female | 40-44 | White | No | Yes | Very good | 8.0 | No | No | No | NaN | DC |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 319790 | Yes | 27.41 | Yes | No | No | 7.0 | 0.0 | Yes | Male | 60-64 | Hispanic | Yes | No | Fair | 6.0 | Yes | No | No | NaN | AZ |
| 319791 | No | 29.84 | Yes | NaN | No | 0.0 | 0.0 | No | Male | 35-39 | Hispanic | No | Yes | Very good | 5.0 | Yes | No | No | No | NH |
| 319792 | No | 24.24 | No | No | No | 0.0 | 0.0 | No | Female | 45-49 | Hispanic | No | Yes | Good | 6.0 | No | No | No | NaN | DE |
| 319793 | No | 32.81 | No | NaN | No | 0.0 | 0.0 | No | Female | 25-29 | Hispanic | No | No | Good | 12.0 | No | No | No | NaN | UT |
| 319794 | No | 46.56 | No | NaN | No | 0.0 | 0.0 | No | Female | 80 or older | Hispanic | No | Yes | Good | 8.0 | No | No | No | NaN | OR |
319795 rows × 20 columns
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 319795 entries, 0 to 319794 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HeartDisease 319795 non-null object 1 BMI 319795 non-null float64 2 Smoking 319795 non-null object 3 AlcoholDrinking 212984 non-null object 4 Stroke 318683 non-null object 5 PhysicalHealth 319795 non-null float64 6 MentalHealth 319795 non-null float64 7 DiffWalking 319795 non-null object 8 Sex 319795 non-null object 9 AgeCategory 319795 non-null object 10 Race 319795 non-null object 11 Diabetic 319795 non-null object 12 PhysicalActivity 319795 non-null object 13 GenHealth 319795 non-null object 14 SleepTime 319795 non-null float64 15 Asthma 319795 non-null object 16 KidneyDisease 319795 non-null object 17 SkinCancer 319446 non-null object 18 HeartDisease_FamilyHistory 35263 non-null object 19 State 319795 non-null object dtypes: float64(4), object(16) memory usage: 48.8+ MB
df.shape
(319795, 20)
df.columns
Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
'Asthma', 'KidneyDisease', 'SkinCancer', 'HeartDisease_FamilyHistory',
'State'],
dtype='object')
numeric_features = df.select_dtypes(include=[np.number])
numeric_features.columns
Index(['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime'], dtype='object')
categorical_features = df.select_dtypes(include=[object])
categorical_features.columns
Index(['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking',
'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity',
'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer',
'HeartDisease_FamilyHistory', 'State'],
dtype='object')
df.duplicated().sum()
254
#drop duplicates
df.drop_duplicates(inplace=True)
df.describe(include=['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| HeartDisease | 319541 | 2 | No | 292168 |
| Smoking | 319541 | 2 | No | 187692 |
| AlcoholDrinking | 212788 | 2 | No | 191014 |
| Stroke | 318429 | 2 | No | 306360 |
| DiffWalking | 319541 | 2 | No | 275131 |
| Sex | 319541 | 2 | Female | 167696 |
| AgeCategory | 319541 | 14 | 65-69 | 34108 |
| Race | 319541 | 6 | White | 244962 |
| Diabetic | 319541 | 4 | No | 269403 |
| PhysicalActivity | 319541 | 2 | Yes | 247705 |
| GenHealth | 319541 | 5 | Very good | 113727 |
| Asthma | 319541 | 2 | No | 276671 |
| KidneyDisease | 319541 | 2 | No | 307762 |
| SkinCancer | 319192 | 2 | No | 289377 |
| HeartDisease_FamilyHistory | 35261 | 2 | No | 32006 |
| State | 319541 | 51 | OH | 6417 |
df.isnull().sum()
HeartDisease 0 BMI 0 Smoking 0 AlcoholDrinking 106753 Stroke 1112 PhysicalHealth 0 MentalHealth 0 DiffWalking 0 Sex 0 AgeCategory 0 Race 0 Diabetic 0 PhysicalActivity 0 GenHealth 0 SleepTime 0 Asthma 0 KidneyDisease 0 SkinCancer 349 HeartDisease_FamilyHistory 284280 State 0 dtype: int64
df.isnull().sum()/len(df)*100
HeartDisease 0.000000 BMI 0.000000 Smoking 0.000000 AlcoholDrinking 33.408232 Stroke 0.347999 PhysicalHealth 0.000000 MentalHealth 0.000000 DiffWalking 0.000000 Sex 0.000000 AgeCategory 0.000000 Race 0.000000 Diabetic 0.000000 PhysicalActivity 0.000000 GenHealth 0.000000 SleepTime 0.000000 Asthma 0.000000 KidneyDisease 0.000000 SkinCancer 0.109219 HeartDisease_FamilyHistory 88.965109 State 0.000000 dtype: float64
import missingno as msno
msno.bar(df)
plt.show()
def categorical_feature_func(df, categorical_features):
num_features = len(categorical_features)
plt.figure(figsize=(25, 20))
for i, feature in enumerate(categorical_features, 1):
plt.subplot(4, 4, i)
sns.set(palette='Paired')
sns.set_style("ticks")
ax = sns.countplot(x=feature, data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.tight_layout()
plt.show()
categorical_feature_func(df, categorical_features)
numeric_features = df.select_dtypes(include=[np.number])
plt.figure(figsize = (25,15))
for i, feature in enumerate(numeric_features.columns):
plt.subplot(2,2,i + 1)
sns.set(palette='dark')
sns.set_style("ticks")
sns.histplot(df[feature],kde=True)
plt.xlabel(feature)
plt.ylabel("Count")
df['HeartDisease']= df['HeartDisease'].replace(['Yes', 'No'] , [1,0])
plt.figure(figsize = (10, 6))
# Configurar fundo branco
plt.style.use("seaborn-v0_8-whitegrid")
# Proporção de doenças cardíacas a partir dos dados
plt.pie(x=df['HeartDisease'].value_counts(),
autopct='%1.3f%%',
labels=df['HeartDisease'].value_counts().index,
colors=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'],
wedgeprops={'linewidth': 1, 'edgecolor': 'white'})
# Configurar título e legenda
plt.title('Proporção de doenças cardíacas')
plt.legend()
# Ajustar layout para melhor aparência
plt.tight_layout()
# Exibir o gráfico
plt.show()
pd.set_option('display.max_rows', None)
df['BMI'].describe()
count 319541.000000 mean 28.328993 std 6.371116 min 12.020000 25% 24.030000 50% 27.340000 75% 31.450000 max 119.000000 Name: BMI, dtype: float64
# Criar o gráfico de dispersão
fig = px.scatter(df, x="BMI", y="SleepTime")
# Exibir o gráfico
fig.show()
# Verificando valores vazios na coluna 'SkinCancer'
df['SkinCancer'].isna().sum()
349
# Remover linhas com valores vazios na coluna 'SkinCancer'
df.dropna(subset=['SkinCancer'], inplace=True)
# Verificando valores vazios na coluna 'Stroke'
df['Stroke'].isna().sum()
1111
# Remover linhas com valores vazios na coluna 'Stroke'
df.dropna(subset=['Stroke'], inplace=True)
import plotly.express as px
def heart_Disease_Func(data, column, count=True):
unique_values = data[column].unique()
null_count = data[column].isnull().sum()
value_counts = data[column].value_counts()
print(f'Quantidade de valores únicos: {len(unique_values)}')
print(f'\nQuais são os valores únicos: {unique_values}')
print(f'\nQuantidade de valores nulos: {null_count}')
print(f'\nQuantidade por opção: \n{value_counts}')
if count:
fig = px.histogram(data, x=column, color='HeartDisease', barmode='group')
fig.show()
else:
fig = px.histogram(data, x=column, marginal='kde')
fig.show()
heart_Disease_Func(df, 'HeartDisease_FamilyHistory')
Quantidade de valores únicos: 3 Quais são os valores únicos: ['No' nan 'Yes'] Quantidade de valores nulos: 282992 Quantidade por opção: No 31856 Yes 3233 Name: HeartDisease_FamilyHistory, dtype: int64
heart_Disease_Func(df, 'SleepTime')
Quantidade de valores únicos: 24 Quais são os valores únicos: [ 5. 7. 8. 6. 12. 4. 9. 10. 15. 3. 2. 1. 16. 18. 14. 20. 11. 13. 17. 24. 19. 21. 22. 23.] Quantidade de valores nulos: 0 Quantidade por opção: 7.0 97164 8.0 97066 6.0 66388 5.0 19098 9.0 15972 10.0 7762 4.0 7712 12.0 2196 3.0 1981 2.0 786 1.0 544 11.0 415 14.0 243 16.0 236 15.0 189 18.0 102 13.0 95 20.0 64 24.0 30 17.0 21 22.0 9 19.0 3 23.0 3 21.0 2 Name: SleepTime, dtype: int64
heart_Disease_Func(df, 'Race')
Quantidade de valores únicos: 6 Quais são os valores únicos: ['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other' 'Hispanic'] Quantidade de valores nulos: 0 Quantidade por opção: White 243825 Hispanic 27317 Black 22846 Other 10887 Asian 8029 American Indian/Alaskan Native 5177 Name: Race, dtype: int64
heart_Disease_Func(df, 'AgeCategory')
Quantidade de valores únicos: 14 Quais são os valores únicos: ['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54' '45-49' '18-24' '35-39' '30-34' '25-29' '0'] Quantidade de valores nulos: 0 Quantidade por opção: 65-69 33965 60-64 33485 70-74 30904 55-59 29589 50-54 25210 80 or older 24063 45-49 21662 75-79 21383 18-24 20959 40-44 20900 35-39 20429 30-34 18627 25-29 16847 0 58 Name: AgeCategory, dtype: int64
heart_Disease_Func(df, 'MentalHealth')
Quantidade de valores únicos: 31 Quais são os valores únicos: [30. 0. 2. 5. 15. 8. 4. 3. 10. 14. 20. 1. 7. 24. 9. 28. 16. 12. 25. 17. 18. 21. 29. 6. 22. 13. 23. 27. 26. 11. 19.] Quantidade de valores nulos: 0 Quantidade por opção: 0.0 204227 30.0 17297 2.0 16417 5.0 14088 10.0 10464 3.0 10411 15.0 9845 1.0 9242 7.0 5505 20.0 5397 4.0 5355 14.0 2042 25.0 1945 6.0 1504 8.0 1091 12.0 755 28.0 509 21.0 350 29.0 316 18.0 211 9.0 203 16.0 151 17.0 127 27.0 125 13.0 110 22.0 98 11.0 83 23.0 67 24.0 66 26.0 59 19.0 21 Name: MentalHealth, dtype: int64
heart_Disease_Func(df, 'PhysicalHealth')
Quantidade de valores únicos: 31 Quais são os valores únicos: [ 3. 0. 20. 28. 6. 15. 5. 30. 7. 1. 2. 21. 4. 10. 14. 18. 8. 25. 16. 29. 27. 17. 24. 12. 23. 26. 22. 19. 9. 13. 11.] Quantidade de valores nulos: 0 Quantidade por opção: 0.0 225304 30.0 19424 2.0 14808 1.0 10436 3.0 8586 5.0 7575 10.0 5425 15.0 4990 7.0 4605 4.0 4443 20.0 3197 14.0 2878 6.0 1265 25.0 1156 8.0 919 21.0 626 12.0 604 28.0 444 29.0 203 9.0 180 18.0 167 16.0 135 27.0 124 17.0 109 13.0 91 22.0 89 11.0 84 24.0 67 26.0 66 23.0 46 19.0 35 Name: PhysicalHealth, dtype: int64
df.AgeCategory.value_counts()
65-69 33965 60-64 33485 70-74 30904 55-59 29589 50-54 25210 80 or older 24063 45-49 21662 75-79 21383 18-24 20959 40-44 20900 35-39 20429 30-34 18627 25-29 16847 0 58 Name: AgeCategory, dtype: int64
df['AgeCategory'].unique()
array(['55-59', '80 or older', '65-69', '75-79', '40-44', '70-74',
'60-64', '50-54', '45-49', '18-24', '35-39', '30-34', '25-29', '0'],
dtype=object)
df.drop(df[df['AgeCategory'] == '0'].index, inplace=True)
# Categorizing Age Groups in DataFrame
df['AgeCategory']=df['AgeCategory'].replace(['18-24','25-29','30-34'],'Jovem')
df['AgeCategory']=df['AgeCategory'].replace(['35-39','40-44','45-49','50-54'],'Adulto')
df['AgeCategory']=df['AgeCategory'].replace(['55-59','60-64','65-69','70-74'],'Idoso')
df['AgeCategory']=df['AgeCategory'].replace(['75-79','80 or older'],'Velho')
# Definir a ordem específica das categorias
order = ['Jovem', 'Adulto', 'Idoso', 'Velho']
# Criar um dicionário mapeando cada categoria para seu valor ordinal
mapping = {category: i for i, category in enumerate(order)}
# Aplicar a codificação ordinal na coluna 'GenHealth'
df['AgeCategory'] = df['AgeCategory'].map(mapping)
df['AgeCategory'].unique()
array([2, 3, 1, 0], dtype=int64)
import plotly.express as px
# Agrupar por 'AgeCategory' e 'HeartDisease' e contar ocorrências
data = df.groupby(['AgeCategory', 'HeartDisease']).size().unstack()
# Configurar cores
cores = ['#71AEC2', '#D58989']
# Criar gráfico de barras interativo
fig = px.bar(data_frame=data, x=data.index, y=data.columns, color_discrete_sequence=cores)
# Configurar título e rótulos dos eixos
fig.update_layout(title='Frequência de doenças cardíacas por categoria de idade',
xaxis_title='Categorias de Idade',
yaxis_title='Frequencia')
# Exibir o gráfico interativo
fig.show()
# Verificando valores nulos na coluna AlcoholDrinking
df['AlcoholDrinking'].value_counts(dropna=False)
No 190078 NaN 106272 Yes 21673 Name: AlcoholDrinking, dtype: int64
# Verificando valores unicos na coluna AlcoholDrinking
df['AlcoholDrinking'].unique()
array(['No', nan, 'Yes'], dtype=object)
# Preenchendo os valores nulos da coluna AlcoholDrinking
df['AlcoholDrinking'].fillna(value='ZZZ', inplace=True)
df['AlcoholDrinking'].unique()
array(['No', 'ZZZ', 'Yes'], dtype=object)
df['AlcoholDrinking'].value_counts(dropna=False)
No 190078 ZZZ 106272 Yes 21673 Name: AlcoholDrinking, dtype: int64
## # Verificando valores nulos na coluna HeartDisease_FamilyHistory
df['HeartDisease_FamilyHistory'].value_counts(dropna=False)
NaN 282938 No 31853 Yes 3232 Name: HeartDisease_FamilyHistory, dtype: int64
# Preenchendo os valores nulos da coluna HeartDisease_FamilyHistory
df['HeartDisease_FamilyHistory'].fillna(value='XXX', inplace=True)
df['HeartDisease_FamilyHistory'].value_counts(dropna=False)
XXX 282938 No 31853 Yes 3232 Name: HeartDisease_FamilyHistory, dtype: int64
# Matrix transposta do df
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| HeartDisease | 0 | 0 | 0 | 0 | 0 |
| BMI | 16.6 | 20.34 | 26.58 | 24.21 | 23.71 |
| Smoking | Yes | No | Yes | No | No |
| AlcoholDrinking | No | ZZZ | ZZZ | ZZZ | No |
| Stroke | No | Yes | No | No | No |
| PhysicalHealth | 3.0 | 0.0 | 20.0 | 0.0 | 28.0 |
| MentalHealth | 30.0 | 0.0 | 30.0 | 0.0 | 0.0 |
| DiffWalking | No | No | No | No | Yes |
| Sex | Female | Female | Male | Female | Female |
| AgeCategory | 2 | 3 | 2 | 3 | 1 |
| Race | White | White | White | White | White |
| Diabetic | Yes | No | Yes | No | No |
| PhysicalActivity | Yes | Yes | Yes | No | Yes |
| GenHealth | Very good | Very good | Fair | Good | Very good |
| SleepTime | 5.0 | 7.0 | 8.0 | 6.0 | 8.0 |
| Asthma | Yes | No | Yes | No | No |
| KidneyDisease | No | No | No | No | No |
| SkinCancer | Yes | No | No | Yes | No |
| HeartDisease_FamilyHistory | No | XXX | XXX | No | XXX |
| State | MT | VT | WY | VT | DC |
df['GenHealth'].unique()
array(['Very good', 'Fair', 'Good', 'Poor', 'Excellent'], dtype=object)
df['GenHealth'].value_counts(dropna=False)
Very good 113183 Good 92693 Excellent 66395 Fair 34506 Poor 11246 Name: GenHealth, dtype: int64
# Definir a ordem específica das categorias
order = ['Poor', 'Fair', 'Good', 'Very good', 'Excellent']
# Criar um dicionário mapeando cada categoria para seu valor ordinal
mapping = {category: i for i, category in enumerate(order)}
# Aplicar a codificação ordinal na coluna 'GenHealth'
df['GenHealth'] = df['GenHealth'].map(mapping)
df['GenHealth'].unique()
array([3, 1, 2, 0, 4], dtype=int64)
df.head(1).T
| 0 | |
|---|---|
| HeartDisease | 0 |
| BMI | 16.6 |
| Smoking | Yes |
| AlcoholDrinking | No |
| Stroke | No |
| PhysicalHealth | 3.0 |
| MentalHealth | 30.0 |
| DiffWalking | No |
| Sex | Female |
| AgeCategory | 2 |
| Race | White |
| Diabetic | Yes |
| PhysicalActivity | Yes |
| GenHealth | 3 |
| SleepTime | 5.0 |
| Asthma | Yes |
| KidneyDisease | No |
| SkinCancer | Yes |
| HeartDisease_FamilyHistory | No |
| State | MT |
df.apply(lambda x: x.nunique(), axis=0)
HeartDisease 2 BMI 3606 Smoking 2 AlcoholDrinking 3 Stroke 2 PhysicalHealth 31 MentalHealth 31 DiffWalking 2 Sex 2 AgeCategory 4 Race 6 Diabetic 4 PhysicalActivity 2 GenHealth 5 SleepTime 24 Asthma 2 KidneyDisease 2 SkinCancer 2 HeartDisease_FamilyHistory 3 State 51 dtype: int64
df['SleepTime'].value_counts(dropna=False)
7.0 97138 8.0 97054 6.0 66379 5.0 19093 9.0 15970 10.0 7761 4.0 7710 12.0 2196 3.0 1980 2.0 786 1.0 544 11.0 415 14.0 243 16.0 236 15.0 189 18.0 102 13.0 95 20.0 64 24.0 30 17.0 21 22.0 9 19.0 3 23.0 3 21.0 2 Name: SleepTime, dtype: int64
# Definir a ordem específica das categorias (valores únicos em ordem crescente)
order = sorted(df['SleepTime'].unique())
# Criar um dicionário mapeando cada valor para seu valor ordinal (1 a 24)
mapping = {value: i+1 for i, value in enumerate(order)}
# Aplicar a codificação ordinal na coluna 'SleepTime' no DataFrame 'df'
df['SleepTime'] = df['SleepTime'].replace(mapping)
df['SleepTime'].value_counts(dropna=False)
7.0 97138 8.0 97054 6.0 66379 5.0 19093 9.0 15970 10.0 7761 4.0 7710 12.0 2196 3.0 1980 2.0 786 1.0 544 11.0 415 14.0 243 16.0 236 15.0 189 18.0 102 13.0 95 20.0 64 24.0 30 17.0 21 22.0 9 19.0 3 23.0 3 21.0 2 Name: SleepTime, dtype: int64
df['SleepTime'].unique()
array([ 5., 7., 8., 6., 12., 4., 9., 10., 15., 3., 2., 1., 16.,
18., 14., 20., 11., 13., 17., 24., 19., 21., 22., 23.])
df['PhysicalHealth'].value_counts(dropna=False)
0.0 225263 30.0 19422 2.0 14805 1.0 10432 3.0 8585 5.0 7575 10.0 5423 15.0 4990 7.0 4605 4.0 4443 20.0 3197 14.0 2877 6.0 1265 25.0 1155 8.0 919 21.0 625 12.0 603 28.0 444 29.0 203 9.0 180 18.0 166 16.0 135 27.0 124 17.0 109 13.0 91 22.0 89 11.0 84 24.0 67 26.0 66 23.0 46 19.0 35 Name: PhysicalHealth, dtype: int64
# Convertendo a coluna "PhysicalHealth" para o tipo int
df['PhysicalHealth'] = df['PhysicalHealth'].astype(int)
df['MentalHealth'].value_counts(dropna=False)
0.0 204191 30.0 17294 2.0 16413 5.0 14084 10.0 10462 3.0 10409 15.0 9842 1.0 9242 7.0 5505 20.0 5396 4.0 5353 14.0 2041 25.0 1945 6.0 1504 8.0 1091 12.0 755 28.0 509 21.0 350 29.0 316 18.0 211 9.0 203 16.0 151 17.0 127 27.0 125 13.0 110 22.0 98 11.0 83 23.0 67 24.0 66 26.0 59 19.0 21 Name: MentalHealth, dtype: int64
#Convertendo a coluna "MentalHealth" para o tipo int
df['MentalHealth'] = df['MentalHealth'].astype(int)
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| HeartDisease | 0 | 0 | 0 | 0 | 0 |
| BMI | 16.6 | 20.34 | 26.58 | 24.21 | 23.71 |
| Smoking | Yes | No | Yes | No | No |
| AlcoholDrinking | No | ZZZ | ZZZ | ZZZ | No |
| Stroke | No | Yes | No | No | No |
| PhysicalHealth | 3 | 0 | 20 | 0 | 28 |
| MentalHealth | 30 | 0 | 30 | 0 | 0 |
| DiffWalking | No | No | No | No | Yes |
| Sex | Female | Female | Male | Female | Female |
| AgeCategory | 2 | 3 | 2 | 3 | 1 |
| Race | White | White | White | White | White |
| Diabetic | Yes | No | Yes | No | No |
| PhysicalActivity | Yes | Yes | Yes | No | Yes |
| GenHealth | 3 | 3 | 1 | 2 | 3 |
| SleepTime | 5.0 | 7.0 | 8.0 | 6.0 | 8.0 |
| Asthma | Yes | No | Yes | No | No |
| KidneyDisease | No | No | No | No | No |
| SkinCancer | Yes | No | No | Yes | No |
| HeartDisease_FamilyHistory | No | XXX | XXX | No | XXX |
| State | MT | VT | WY | VT | DC |
df.apply(lambda x: x.unique(), axis=0)
HeartDisease [0, 1] BMI [16.6, 20.34, 26.58, 24.21, 23.71, 28.87, 21.6... Smoking [Yes, No] AlcoholDrinking [No, ZZZ, Yes] Stroke [No, Yes] PhysicalHealth [3, 0, 20, 28, 6, 15, 5, 30, 7, 1, 2, 21, 4, 1... MentalHealth [30, 0, 2, 5, 15, 8, 4, 3, 10, 14, 20, 1, 7, 2... DiffWalking [No, Yes] Sex [Female, Male] AgeCategory [2, 3, 1, 0] Race [White, Black, Asian, American Indian/Alaskan ... Diabetic [Yes, No, No, borderline diabetes, Yes (during... PhysicalActivity [Yes, No] GenHealth [3, 1, 2, 0, 4] SleepTime [5.0, 7.0, 8.0, 6.0, 12.0, 4.0, 9.0, 10.0, 15.... Asthma [Yes, No] KidneyDisease [No, Yes] SkinCancer [Yes, No] HeartDisease_FamilyHistory [No, XXX, Yes] State [MT, VT, WY, DC, PA, AK, KY, DE, CA, NM, WI, V... dtype: object
#criando a funçao para modificar as colunas com One hoting Encoding
def OHE(dataframe, column_name):
dummy_dataset = pd.get_dummies(dataframe[column_name], prefix=column_name)
dataframe = pd.concat([dataframe, dummy_dataset], axis=1)
dataframe.drop(column_name, axis=1, inplace=True)
del dummy_dataset
return dataframe
df = OHE(df, 'Smoking')
df['AlcoholDrinking'].value_counts()
No 190078 ZZZ 106272 Yes 21673 Name: AlcoholDrinking, dtype: int64
df = OHE(df, 'AlcoholDrinking')
df['Stroke'].value_counts()
No 305966 Yes 12057 Name: Stroke, dtype: int64
df = OHE(df, 'Stroke')
df['DiffWalking'].value_counts()
No 273791 Yes 44232 Name: DiffWalking, dtype: int64
df = OHE(df, 'DiffWalking')
df['Sex'].value_counts()
Female 166896 Male 151127 Name: Sex, dtype: int64
df = OHE(df, 'Sex')
df['Race'].value_counts(dropna=False)
White 243783 Hispanic 27314 Black 22840 Other 10884 Asian 8026 American Indian/Alaskan Native 5176 Name: Race, dtype: int64
df = OHE(df, 'Race')
df = OHE(df, 'Diabetic')
df = OHE(df, 'PhysicalActivity')
df = OHE(df, 'Asthma')
df = OHE(df, 'KidneyDisease')
df = OHE(df, 'SkinCancer')
df = OHE(df, 'HeartDisease_FamilyHistory')
df.apply(lambda x: x.nunique(), axis=0)
HeartDisease 2 BMI 3606 PhysicalHealth 31 MentalHealth 31 AgeCategory 4 GenHealth 5 SleepTime 24 State 51 Smoking_No 2 Smoking_Yes 2 AlcoholDrinking_No 2 AlcoholDrinking_Yes 2 AlcoholDrinking_ZZZ 2 Stroke_No 2 Stroke_Yes 2 DiffWalking_No 2 DiffWalking_Yes 2 Sex_Female 2 Sex_Male 2 Race_American Indian/Alaskan Native 2 Race_Asian 2 Race_Black 2 Race_Hispanic 2 Race_Other 2 Race_White 2 Diabetic_No 2 Diabetic_No, borderline diabetes 2 Diabetic_Yes 2 Diabetic_Yes (during pregnancy) 2 PhysicalActivity_No 2 PhysicalActivity_Yes 2 Asthma_No 2 Asthma_Yes 2 KidneyDisease_No 2 KidneyDisease_Yes 2 SkinCancer_No 2 SkinCancer_Yes 2 HeartDisease_FamilyHistory_No 2 HeartDisease_FamilyHistory_XXX 2 HeartDisease_FamilyHistory_Yes 2 dtype: int64
# Remover a coluna "state" do DataFrame
df.drop('State', axis=1, inplace=True)
# iterate over the list to print all unique values of each column in the dataframe
for column in list(df.columns.values):
print(column, ':', str(df[column].unique()))
HeartDisease : [0 1] BMI : [16.6 20.34 26.58 ... 62.42 51.46 46.56] PhysicalHealth : [ 3 0 20 28 6 15 5 30 7 1 2 21 4 10 14 18 8 25 16 29 27 17 24 12 23 26 22 19 9 13 11] MentalHealth : [30 0 2 5 15 8 4 3 10 14 20 1 7 24 9 28 16 12 25 17 18 21 29 6 22 13 23 27 26 11 19] AgeCategory : [2 3 1 0] GenHealth : [3 1 2 0 4] SleepTime : [ 5. 7. 8. 6. 12. 4. 9. 10. 15. 3. 2. 1. 16. 18. 14. 20. 11. 13. 17. 24. 19. 21. 22. 23.] Smoking_No : [0 1] Smoking_Yes : [1 0] AlcoholDrinking_No : [1 0] AlcoholDrinking_Yes : [0 1] AlcoholDrinking_ZZZ : [0 1] Stroke_No : [1 0] Stroke_Yes : [0 1] DiffWalking_No : [1 0] DiffWalking_Yes : [0 1] Sex_Female : [1 0] Sex_Male : [0 1] Race_American Indian/Alaskan Native : [0 1] Race_Asian : [0 1] Race_Black : [0 1] Race_Hispanic : [0 1] Race_Other : [0 1] Race_White : [1 0] Diabetic_No : [0 1] Diabetic_No, borderline diabetes : [0 1] Diabetic_Yes : [1 0] Diabetic_Yes (during pregnancy) : [0 1] PhysicalActivity_No : [0 1] PhysicalActivity_Yes : [1 0] Asthma_No : [0 1] Asthma_Yes : [1 0] KidneyDisease_No : [1 0] KidneyDisease_Yes : [0 1] SkinCancer_No : [0 1] SkinCancer_Yes : [1 0] HeartDisease_FamilyHistory_No : [1 0] HeartDisease_FamilyHistory_XXX : [0 1] HeartDisease_FamilyHistory_Yes : [0 1]
from sklearn import model_selection # model assesment and model selection strategies
from sklearn import metrics # model evaluation metrics
# development = train + test
dev_df_X = df.drop('HeartDisease', axis=1) # development = train + test
dev_df_y = df[['HeartDisease']]
# validation
val_df_X = df.drop('HeartDisease', axis=1)
val_df_y = df[['HeartDisease']]
dev_df_X.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| BMI | 16.6 | 20.34 | 26.58 | 24.21 | 23.71 |
| PhysicalHealth | 3.0 | 0.00 | 20.00 | 0.00 | 28.00 |
| MentalHealth | 30.0 | 0.00 | 30.00 | 0.00 | 0.00 |
| AgeCategory | 2.0 | 3.00 | 2.00 | 3.00 | 1.00 |
| GenHealth | 3.0 | 3.00 | 1.00 | 2.00 | 3.00 |
| SleepTime | 5.0 | 7.00 | 8.00 | 6.00 | 8.00 |
| Smoking_No | 0.0 | 1.00 | 0.00 | 1.00 | 1.00 |
| Smoking_Yes | 1.0 | 0.00 | 1.00 | 0.00 | 0.00 |
| AlcoholDrinking_No | 1.0 | 0.00 | 0.00 | 0.00 | 1.00 |
| AlcoholDrinking_Yes | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| AlcoholDrinking_ZZZ | 0.0 | 1.00 | 1.00 | 1.00 | 0.00 |
| Stroke_No | 1.0 | 0.00 | 1.00 | 1.00 | 1.00 |
| Stroke_Yes | 0.0 | 1.00 | 0.00 | 0.00 | 0.00 |
| DiffWalking_No | 1.0 | 1.00 | 1.00 | 1.00 | 0.00 |
| DiffWalking_Yes | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 |
| Sex_Female | 1.0 | 1.00 | 0.00 | 1.00 | 1.00 |
| Sex_Male | 0.0 | 0.00 | 1.00 | 0.00 | 0.00 |
| Race_American Indian/Alaskan Native | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Race_Asian | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Race_Black | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Race_Hispanic | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Race_Other | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Race_White | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 |
| Diabetic_No | 0.0 | 1.00 | 0.00 | 1.00 | 1.00 |
| Diabetic_No, borderline diabetes | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| Diabetic_Yes | 1.0 | 0.00 | 1.00 | 0.00 | 0.00 |
| Diabetic_Yes (during pregnancy) | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| PhysicalActivity_No | 0.0 | 0.00 | 0.00 | 1.00 | 0.00 |
| PhysicalActivity_Yes | 1.0 | 1.00 | 1.00 | 0.00 | 1.00 |
| Asthma_No | 0.0 | 1.00 | 0.00 | 1.00 | 1.00 |
| Asthma_Yes | 1.0 | 0.00 | 1.00 | 0.00 | 0.00 |
| KidneyDisease_No | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 |
| KidneyDisease_Yes | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
| SkinCancer_No | 0.0 | 1.00 | 1.00 | 0.00 | 1.00 |
| SkinCancer_Yes | 1.0 | 0.00 | 0.00 | 1.00 | 0.00 |
| HeartDisease_FamilyHistory_No | 1.0 | 0.00 | 0.00 | 1.00 | 0.00 |
| HeartDisease_FamilyHistory_XXX | 0.0 | 1.00 | 1.00 | 0.00 | 1.00 |
| HeartDisease_FamilyHistory_Yes | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
dev_df_y.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| HeartDisease | 0 | 0 | 0 | 0 | 0 |
X_train, X_test, y_train, y_test = model_selection.train_test_split(
dev_df_X, # X
dev_df_y, # y
test_size = 0.30,
random_state = 42
)
X_train.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 222616 entries, 171781 to 122586 Columns: 38 entries, BMI to HeartDisease_FamilyHistory_Yes dtypes: float64(2), int32(2), int64(2), uint8(32) memory usage: 17.0 MB
X_test.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 95407 entries, 304097 to 317388 Columns: 38 entries, BMI to HeartDisease_FamilyHistory_Yes dtypes: float64(2), int32(2), int64(2), uint8(32) memory usage: 7.3 MB
X_train.describe().T.head()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BMI | 222616.0 | 28.335417 | 6.377846 | 12.02 | 24.03 | 27.34 | 31.45 | 119.0 |
| PhysicalHealth | 222616.0 | 3.372188 | 7.952641 | 0.00 | 0.00 | 0.00 | 2.00 | 30.0 |
| MentalHealth | 222616.0 | 3.891683 | 7.944438 | 0.00 | 0.00 | 0.00 | 3.00 | 30.0 |
| AgeCategory | 222616.0 | 1.509339 | 0.943634 | 0.00 | 1.00 | 2.00 | 2.00 | 3.0 |
| GenHealth | 222616.0 | 2.594176 | 1.043680 | 0.00 | 2.00 | 3.00 | 3.00 | 4.0 |
X_test.describe().T.head()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BMI | 95407.0 | 28.314242 | 6.356699 | 12.02 | 24.0 | 27.32 | 31.45 | 92.53 |
| PhysicalHealth | 95407.0 | 3.380381 | 7.957179 | 0.00 | 0.0 | 0.00 | 2.00 | 30.00 |
| MentalHealth | 95407.0 | 3.922249 | 7.986935 | 0.00 | 0.0 | 0.00 | 3.00 | 30.00 |
| AgeCategory | 95407.0 | 1.513746 | 0.943889 | 0.00 | 1.0 | 2.00 | 2.00 | 3.00 |
| GenHealth | 95407.0 | 2.594317 | 1.041420 | 0.00 | 2.0 | 3.00 | 3.00 | 4.00 |
y_train.describe().T.head()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| HeartDisease | 222616.0 | 0.085744 | 0.279986 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
y_test.describe().T.head()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| HeartDisease | 95407.0 | 0.085633 | 0.279823 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Criar o modelo Decision Tree Classifier com hiperparâmetros ajustados
DT = DecisionTreeClassifier(max_depth=4, random_state=42)
# Treinar o modelo
DT.fit(X_train, y_train)
# Fazer previsões no conjunto de teste
y_pred1 = DT.predict(X_test)
# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred1)
print("Acurácia do modelo Decision Tree Classifier: {:.2f}%".format(accuracy * 100))
# Realizar validação cruzada
cv_scores = cross_val_score(DT, X_train, y_train, cv=5)
print("Acurácia da validação cruzada (média): {:.2f}%".format(cv_scores.mean() * 100))
# Visualizar a árvore de decisão
fig, ax = plt.subplots(figsize=(40, 20))
tree.plot_tree(DT,
ax=ax,
fontsize=12,
proportion=True,
filled=True,
feature_names=X_train.columns)
plt.show()
Acurácia do modelo Decision Tree Classifier: 91.47% Acurácia da validação cruzada (média): 91.47%
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from tqdm import tqdm_notebook
# Transformar o array de rótulos em uma matriz unidimensional
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
# Criar o modelo KNN
KNN = KNeighborsClassifier(n_neighbors=3) # Defina o número de vizinhos desejado
# Treinar o modelo
KNN.fit(X_train, y_train)
# Fazer previsões no conjunto de teste
y_pred2 = KNN.predict(X_test)
# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred2)
print("Acurácia do modelo KNeighbors Classifier é: {:.2f}%".format(accuracy * 100))
# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = KNN.predict_proba(X_test)[:, 1]
# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - KNeighbors Classifier')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo KNeighbors Classifier é: 90.04%
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc
# Criar o modelo Random Forest Classifier
RF = RandomForestClassifier(n_estimators=100) # Defina o número de estimadores desejado
# Treinar o modelo
RF.fit(X_train, y_train)
# Fazer previsões no conjunto de teste
y_pred3 = RF.predict(X_test)
# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred3)
print("Acurácia do modelo: {:.2f}%".format(accuracy * 100))
# Exibir a matriz de confusão
confusion_mat = confusion_matrix(y_test, y_pred3)
print("Matriz de Confusão:")
print(confusion_mat)
# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = RF.predict_proba(X_test)[:, 1]
# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo: 90.41% Matriz de Confusão: [[85279 1958] [ 7195 975]]
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
# Criar o modelo Gradient Boosting Classifier
GB = GradientBoostingClassifier()
# Treinar o modelo
GB.fit(X_train, y_train)
# Fazer previsões no conjunto de teste
y_pred4 = GB.predict(X_test)
# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred4)
print("Acurácia do modelo Gradient Boosting Classifier: {:.2f}%".format(accuracy * 100))
# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = GB.predict_proba(X_test)[:, 1]
# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - Gradient Boosting Classifier')
plt.legend(loc="lower right")
plt.show()
Acurácia do modelo Gradient Boosting Classifier: 91.59%
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
# Pré-processar os dados - Padronizar os recursos
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Criar o modelo de regressão logística
LR = LogisticRegression()
# Treinar o modelo usando os dados de treinamento
LR.fit(X_train, y_train)
# Fazer previsões usando os dados de teste
y_pred5 = LR.predict(X_test)
# Calcular a acurácia do modelo
accuracy = accuracy_score(y_test, y_pred5)
print("Acurácia:", accuracy)
# Calcular outras métricas de avaliação
print("Relatório de Classificação:")
print(classification_report(y_test, y_pred5))
# Fazer previsões das probabilidades no conjunto de teste
y_pred_proba = LR.predict_proba(X_test)[:, 1]
# Calcular a curva ROC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Plotar a curva ROC
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='Curva ROC (área = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC - Regressão Logística')
plt.legend(loc="lower right")
plt.show()
Acurácia: 0.915603676879055
Relatório de Classificação:
precision recall f1-score support
0 0.92 0.99 0.96 87237
1 0.54 0.11 0.18 8170
accuracy 0.92 95407
macro avg 0.73 0.55 0.57 95407
weighted avg 0.89 0.92 0.89 95407
(Decison Tree vs KNeighborsClassifier vs Random Forest vs Gradient Boosting vs LogisticRegression )
final_data = pd.DataFrame ({ 'MODELOS': [ 'DT', 'KNN', 'RF', 'GB', 'LR'],
'ACC': [accuracy_score(y_test, y_pred1),
accuracy_score(y_test, y_pred2),
accuracy_score(y_test, y_pred3),
accuracy_score(y_test, y_pred4),
accuracy_score(y_test, y_pred5)]})
final_data
| MODELOS | ACC | |
|---|---|---|
| 0 | DT | 0.914713 |
| 1 | KNN | 0.900437 |
| 2 | RF | 0.904064 |
| 3 | GB | 0.915950 |
| 4 | LR | 0.915604 |
RANDOM_STATE = 42
n_estimators = 50
max_depth = 5
models = [
('DT', DecisionTreeClassifier(max_depth=max_depth, random_state=42)),
('KNN', KNeighborsClassifier()),
('RF', RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_STATE)),
('GB', GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_STATE)),
('LR', LogisticRegression(random_state=RANDOM_STATE))
]
plt.clf()
for model in models:
model_name = model[0]
model_instance = model[1]
model_instance.fit(X_train, np.ravel(y_train))
predictions = model_instance.predict_proba(X_test)[:,1]
auc_score = metrics.roc_auc_score(y_test, predictions)
print('ROC AUC Score for {}: {}'.format(model_name, auc_score))
fpr, tpr, _ = metrics.roc_curve(y_test, predictions)
plt.plot(fpr, tpr, label='ROC Curve for {} - Area: {:2f}'.format(model_name, auc_score))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc="lower right")
plt.title('ROC curve')
plt.show()
ROC AUC Score for DT: 0.8186580594634724 ROC AUC Score for KNN: 0.6969406951439943 ROC AUC Score for RF: 0.8281936835808315 ROC AUC Score for GB: 0.8410159831483136 ROC AUC Score for LR: 0.8384305025425678
#%%time
#profile = ProfileReport(df, title="Pandas Profiling Report")
#profile.to_file("reports/EDA.html")